The exam consists of 4 parts in which you are asked to conduct analysis of different datasets. The datasets are included in different R packages and you need to install the packages to access the data. Your analysis should be done using R and your answers should be given in R code. For example, if the question is
Your solution should be
x=rnorm(100,0,1)
print(x)
## [1] 1.65362030 0.62683622 0.75022139 1.68739905 -0.26007566 -1.92977840
## [7] 0.06003585 0.55456955 -2.40244690 0.79892616 -2.17961102 0.14076626
## [13] -0.74533343 0.51104061 0.77878205 -0.21054811 -0.24805161 0.17995228
## [19] 0.98353873 -1.25346127 -1.03440501 0.93415047 -0.94877754 -0.66402178
## [25] -0.82485427 -1.49721799 -0.65667812 -1.69673125 -1.88151230 -0.32032451
## [31] -0.64767740 1.15508865 0.81657239 -1.19144632 1.17707527 0.41089329
## [37] -0.19328650 2.11010022 -0.71521787 0.56209185 -0.93160136 0.51602422
## [43] 0.20729464 1.47205930 0.27929390 -0.30594088 0.01695407 0.90540919
## [49] -0.68966337 -0.37077753 2.55806001 -0.14153531 0.77629640 0.18721927
## [55] 0.11841085 -1.24301926 1.70888182 -0.82133509 1.58925107 1.02188926
## [61] -2.27331597 -0.88463852 -0.38904862 0.65800056 -0.88027582 -0.75034992
## [67] -1.48677107 -0.59408482 1.76406761 -1.09354693 0.79361937 0.10908053
## [73] 1.84181033 -0.75735110 -1.21056562 0.93141854 -1.14904874 0.46622605
## [79] -1.03627287 1.33564006 1.50289110 -0.54677696 -0.65563349 -1.59628018
## [85] 0.18095789 -0.25274895 1.62711569 1.08350167 -1.05625222 0.91709850
## [91] 0.99079235 -0.54681488 -0.90975190 -0.31908084 0.07131092 -1.18123858
## [97] -2.16007960 0.94910571 0.05147585 -1.00191322
hist(x)
You do not need to explain your R code. For example, you do not need to write: “the function hist() was used to produce the histogram.” Your answers to the questions should be the R code that you used to produce the output.
You need to submit the following materials:
You do not need to interpret the results!!! For example, if the question is to fit a One-Way ANOVA model, you do not need to formulate the model and to interpret the results. This means, for example, that you do not need to write “the p-value is 0.007 indicating on a significant effect of the factor.”
You should upload the solutions in BB. You will receive information about the submission by email.
The second part of the exam (part 2) will be available online in BB on 13/01/2025 at 08:00-11:30.
The oral exam will take place on 13/01/25, 14/01/25, and 15/01/25. You will receive information about your exam date and time by email. The schedule is available online in BB.
For the analysis of this part we use the data Hitters which is a part of the R package ISLR. This is the major league baseball data from the 1986 and 1987 seasons. More information can be found in https://rdrr.io/cran/ISLR/man/Hitters.html. The code below can be used to access the data
library(ISLR)
data(Hitters)
names(Hitters)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
## [7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
## [13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
## [19] "Salary" "NewLeague"
In this question we focus on the player’s division at the end of 1986 (the variable Division) and the number of runs in 1986 (the variable Runs).
How many observations are there in each category of the variable Division?
Produce the table below.
## # A tibble: 2 × 2
## Division median_runs
## <fct> <int>
## 1 E 49
## 2 W 46
Conduct a Wilcoxon test for two independent samples to test if the number of runs (the variable Runs) is equal across the divisions.
Produce Figure 1.1. Note that the points in red are the sample means.
Figure 1.1
In this question we focus on the variable number of walks in 1986 (the variable Walks) in addition to the variables from Q1.
Figure 1.2
Figure 1.3
Figure 1.4
## # A tibble: 2 × 2
## Division Correlation
## <fct> <dbl>
## 1 E 0.745
## 2 W 0.718
\[\mbox{Runs}_{i}=\beta_{0}+\beta_{1} \mbox{Walks}_{i} +\beta_{2} \mbox{Division}_{i} +\varepsilon_{i} \].
Define a R object, fit.coef, in which you store the parameter estimates of the coefficients and print the object.
Let \(e_{i}\) the residual obtained for the regression model in Q2.3. Let \(es_{i}\) the standardized residual given by: \[es_{i}=\frac{e_{i}}{MSE}\] Check if the standardized residuals follow a standard normal distribution using a qq normal probability plot shown in Figure 1.5.
Figure 1.5
Create a new dataset in which only observations with number of runs in 1986 greater than 30 are included. The following variables should be included in the dataset: Hits, HmRun, Runs, Walks and Division.
How many observations are included in the new dataset?
Sort the new data according to the variable the number of hits in 1986 (the variable Hits). Print the top 5 observations in each division.
Export the new dataset that was created in Q3.1 as an excel file (and include the data in out output that you submit as a solution for the exam).
In this part you need to prepare a presentation of 5 slides using R markdown about the analysis that you conducted in part 1.
In this part of the exam, we focus on the Boston dataset which is a part of the MASS R package. To access the data you need to install the package. More information can be found in https://www.statology.org/boston-dataset-r/. Use the code below to access the data.
library(MASS)
data(Boston)
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
How many observations and variables are included in the dataset? How many missing values, per each variable, are there in the dataset?
Calculate the minimum and maximum for the variables crim, zn and indus across the levels of the variable chas. Produce the panel below.
## # A tibble: 2 × 7
## chas crim_min crim_max zn_min zn_max indus_min indus_max
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.00632 89.0 0 100 0.46 27.7
## 2 1 0.0150 8.98 0 90 1.21 19.6
Count the number of homes that are near the Charles River (i.e., observations with chas equal to 1) vs. those that are not located near to the Charles river (observations with chas equal to zero).
For each level of the variable chas, calculate the average number of rooms per dwelling (the variable rm). Sort the data according to the average number of rooms per dwelling. Print the average number of rooms per dwelling.
Create a new data frame, Boston2, for which the crime rate (the variable crim) is lower than 5 and the proportion of lower-status population (the variable lstat) is lower than 10. How many observations are included in this data frame?
What are the average median home value (the variable medv ) and the average number of rooms per dwelling (the variable rm) for the dataset created in Q6.1.
Visualize the relationship between the variables medv and rm for the dataset created in Q6.1 as shown in Figure 6.1.
Figure 6.1
Figure 6.2
Figure 6.3
Define a new categorical variable crim_cat in the following way: Re-code the variable crim into three categories:
crim <5: Low.
crim 5-15: Medium.
crim >15: High.
Count how many observations are included in each category.
Figure 7.1
##
## Low Medium High
## 0 370 71 30
## 1 30 5 0
Figure 7.2
Calculate the correlation between the proportion of Black residents by town (the variable black) and the status of the population (the variable lstat) using the R function cor.test().
Use the R package corrplot to produce the heatmap of correlations between variables shown in Figure 8.1. Note that the categorical variables are excluded from the data in this figure.
Figure 8.1
Define a new categorical variable, medv_cat, in the following way: Re-code the variable medv into three categories:
medv < 15: Low.
medv 15-25: Medium.
medv > 25: High.
Include the new variable in the Boston dataset and produce the the table below.
## # A tibble: 3 × 4
## medv_cat mean SD N
## <chr> <dbl> <dbl> <int>
## 1 High 35.3 7.88 124
## 2 Low 11.6 2.64 94
## 3 Medium 20.6 2.69 288
Figure 9.1
Test if the means of the variable nox across the three groups of the variable medv_cat are equal using the Kruskal-Wallis test. Print the output from the Kruskal-Wallis test.
Produce Figure 9.2. To make the plot, you can use the function ggline() of the package ggpubr or any other R package/function that you wish. Do not forget to add the error bars around the sample means to the figure.
Figure 9.2
For this question, use the version of the Boston dataset produced in Q9.1.
crim_level: Represents the crime rate divided into three groups:
"Low": Bottom 33% of crime rate values.
"Medium": Middle 33% of crime rate values.
"High": Top 33% of crime rate values.
nox_level: Represents nitric oxide concentration divided into two groups:
"Low": Bottom 50% of nitric oxide concentration values.
"High": Top 50% of nitric oxide concentration values.
Figure 10.1
For subjects with Low level of crim, conduct a t-test to test the hypothesis that the mean medv is equal between the low and high nitrogen oxides concentration groups.
Create a R object that contain the \(95\%\) confidence interval for the mean difference. DO NOT use object=c(3.932519,8.891779). Print the object.
In this part of the exam, the questions are focused on the PimaIndiansDiabetes2 dataset which is a part of the mlbench R package. To access the data you need to install the package. More information about the dataset and variables can be found in https://search.r-project.org/CRAN/refmans/mlbench/html/PimaIndiansDiabetes.html. Use the code below to access the data.
library(mlbench)
data(PimaIndiansDiabetes2)
names(PimaIndiansDiabetes2)
## [1] "pregnant" "glucose" "pressure" "triceps" "insulin" "mass" "pedigree"
## [8] "age" "diabetes"
Filter out observations with missing values and define a new dataset: new_PimaIndiansDiabetes2. How many observations are included in the new dataset?
For each level of diabetes status (the variable diabetes), identify the top 5 patients with highest glucose and lowest mass values and produce the object bellow.
## # A tibble: 10 × 9
## # Groups: diabetes [2]
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4 197 70 39 744 36.7 2.33 31 neg
## 2 1 193 50 16 375 25.9 0.655 24 neg
## 3 3 191 68 15 130 30.9 0.299 34 neg
## 4 3 180 64 25 70 34 0.271 26 neg
## 5 0 173 78 32 265 46.5 1.16 58 neg
## 6 0 198 66 32 274 41.3 0.502 28 pos
## 7 2 197 70 45 543 30.5 0.158 53 pos
## 8 1 196 76 36 249 36.5 0.875 29 pos
## 9 8 196 76 29 280 37.5 0.605 57 pos
## 10 7 195 70 33 145 25.1 0.163 55 pos
In this question we use the PimaIndiansDiabetes2 dataset.
Remove missing values from the variables glucose and mass and create a new dataset new_PimaIndiansDiabetes3. How many observationas are included in the new dataset?
Create a new variable glucose_level which categorizes the variable glucose as “Low”, “Normal”, or “High” based on quantiles: bottom 25% as Low, 25%-75% as Normal, and top 25% as High.
Define a new R objectes, the mean (mean_mass) and standard deviation (sd_mass) of the variable mass within each glucose_level category and produce the folowing table:
## # A tibble: 3 × 3
## glucose_level mean_mass sd_mass
## <chr> <dbl> <dbl>
## 1 High 35 6.8
## 2 Low 30.4 6.6
## 3 Normal 32.2 6.8
In this question we use the dataset new_PimaIndiansDiabetes3 that was created in Q12.1.
Figure 13.1
In this question we use dataset new_PimaIndiansDiabetes3 created in Question Q12.1.
Create a new variable, age_adjusted_risk, based the equation below. \[ \mathrm{age\_adjusted\_risk} = \sqrt{\frac{0.5 \times \mathrm{glucose} + 0.3 \times \mathrm{mass} + 0.2 \times \mathrm{pressure}}{\mathrm{age}}}. \]
Produce the Figure 14.1. Note that the information that is provided in the title is an output from a two sample t-test of the adjusted risk (defined in Q14.1) across the diabetes groups (the variable diabetes).
Figure 14.1
Create a function that receives as an input: (1) a data frame (data), (2) a names of a categorical variable (group_col) and (3) all numeric variables (numeric_cols). The function output should be a table with the mean, median, standard deviation, and IQR for all the numeric variables in the data frame across the level of the categorical variable.
Apply the function that you wrote in Q15.1 to the data frame (PimaIndiansDiabetes2), use the variable (diabetes) as the categorical variable and all the numerical variables in the data set. Print the mean of two numeric variables only (glucose, insulin).
Create a new dataset, PimaIndiansDiabetes5, from the original dataset PimaIndiansDiabetes2 and remove missing values.
Using the new dataset created in Question 16.1, create another dataset , median_insulin_data, by including only patients over 40 years old (the variable age) with blood pressure above 70 (the variable pressure), and select only patients whose insulin levels (the variable insulin) is above the median insulin level. How many observations are included in the new dataset?
Using the dataset created in Question Q16.2, create a new dataset selected_median_insulin which includes the variables age, pressure, insulin, pedigree, pregnant, diabetes variables. Display the last six observations of the dataset.
In this question we use the dataset created in Q16.1 (PimaIndiansDiabetes5).
Create a new dataset filtered_primadia5 by including patients with the variable mass above the median and the variable pedigree above the mean. How many observations are included in the new dataset?
Using the new dataset, produce an animated dynamic plot, shown in Figure 17.1 (Note the plot is colored by the variable age). You need to look at this plot in the HTML file of the exam. Produce an identical plot. Note that it should be produced in the HTML version of your solution, on the PDF file of the solution it will be a static file.
Figure 17.1
Create a new dataset, prima_data, without missing data data using PimaIndiansDiabetes2.
How many observations are included in the new dataset?
Recode the variable pregnant in the following way: 0 = “0”, 1 = “1”, 2 = “2” and >= 3 = “3+”. Name the new variable pregnant_grouped. Add the new variable to the dataset prima_data. Print the first 5 observations for whom the number of pregnancies is equal to 2.
Use the prima_data created in Q18.1, create a new dataset mean_mass_data containing the mean of the variable (mass) for each combination of grouped pregnancy levels (pregnancy_grouped) and diabetes status (the variable diabetes). What is the dimension of the new dataset? Print the new dataset.
Produce Figure 18.1 (blood pressure VS. pregnancies).
Figure 18.1
## # A tibble: 17 × 3
## pregnant mean_age sd_age
## <dbl> <dbl> <dbl>
## 1 0 27.6 9.69
## 2 1 27.4 8.11
## 3 2 27.2 9.55
## 4 3 29.0 8.10
## 5 4 32.8 11.0
## 6 5 39.0 12.5
## 7 6 39.3 12.0
## 8 7 41.1 7.93
## 9 8 45.4 10.7
## 10 9 44.2 10.4
## 11 10 42.7 9.37
## 12 11 44.5 6.19
## 13 12 47.4 7.76
## 14 13 44.5 5.84
## 15 14 42 5.66
## 16 15 43 NA
## 17 17 47 NA
Figure 18.2
Figure 18.3
In this question we use the dataset PimaIndiansDiabetes5 that was created in Q16.1.
## pregnancies correlation
## 1 0 0.04483389
## 2 1 0.18459352
## 3 2 0.21800751
## 4 3 0.21480046
## 5 4 -0.03648928
## 6 5 0.19994282
## 7 6 0.44004193
## 8 7 -0.19804490
## 9 8 0.25516453
## 10 9 0.70865117
## 11 10 -0.74739750
## 12 11 -0.02292352
## 13 12 0.17512432
## 14 13 0.93676591
## 15 14 NA
## 16 15 NA
## 17 17 NA
In this question we use the dataset prima_data created in Q18.1.
Figure 20.1